Multilingual document clustering : state of the art (Construction de corpus multilingues : état de l'art) [in French]

نویسنده

  • Manuela Yapomo
چکیده

Multilingual document clustering : state of the art Multilingual corpora are extensively exploited in several branches of natural language processing. This paper presents an overview of works in the automatic construction of such corpora. We address this topic by first providing an overview of different perceptions of comparability. We then examine the main approaches to similarity computation, construction and evaluation developed in the field. We notice that the measurement of the textual similarity is usually based on corpus statistics or the structure of ontological resources or on a combination of these two approaches. In a multilingual framework, with the use of a multilingual dictionary or a machine translator, many problems arise. The exploitation of a multilingual ontological ressource seems to be a worthy option. In clustering, the problem of adding documents to the initial base without affecting the quality of clusters remains open. MOTS-CLÉS : corpus multilingues, comparabilité, similarité textuelle translingue, classification.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

State of the Art of Automatic Keyphrase Extraction Methods (État de l'art des méthodes d'extraction automatique de termes-clés) [in French]

State of the Art of Automatic Keyphrase Extraction Methods This article presents the state of the art of the automatic keyphrase extraction methods. The aim of the automatic keyphrase extraction task is to extract the most representative terms of a document. Automatic keyphrase extraction methods can be divided into two categories : supervised methods and unsupervised methods. For supervised me...

متن کامل

État de l'art : L'influence du domaine sur la classification de l'opinion (State of the Art : Influence of Domain on Opinion Classification) [in French]

State of the Art : Influence of Domain on Opinion Classification The interest in opinion mining has grown concurrently with blogs, forums, and others platforms where the internauts can freely write about their opinion on every topic. As the amounts of available data are increasingly huge, the use of automatic methods for opinion mining becomes imperative. However, sentiment is expressed differe...

متن کامل

Second order similarity for exploring multilingual textual databases (Similarité de second ordre pour l'exploration de bases textuelles multilingues) [in French]

RÉSUMÉ Cet article décrit l’utilisation de la technique de similarité de second ordre pour l’identification de textes semblables au sein d’une base de rapports d’incidents aéronautiques mélangeant les langues française et anglaise. L’objectif du système est, pour un document donné, de retrouver des documents au contenu similaire quelle que soit leur langue. Nous utilisons un corpus bilingue ali...

متن کامل

Construction of a Free Large Part-of-Speech Annotated Corpus in French (Construction d'un large corpus écrit libre annoté morpho-syntaxiquement en français) [in French]

RÉSUMÉ Cet article étudie la possibilité de créer un nouveau corpus écrit en français annoté morphosyntaxiquement à partir d’un corpus annoté existant. Nos objectifs sont de se libérer de la licence d’exploitation contraignante du corpus d’origine et d’obtenir une modernisation perpétuelle des textes. Nous montrons qu’un corpus pré-annoté automatiquement peut permettre d’entraîner un étiqueteur...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013